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Abstract: Building mathematical models of cellular networks lies at the core of 
systems biology. It involves, among other tasks, the reconstruction of the structure of 
interactions between molecular components, which is known as network inference or reverse 
engineering. Information theory can help in the goal of extracting as much information as 
possible from the available data. A large number of methods founded on these concepts 
have been proposed in the literature, not only in biology journals, but in a wide range of 
areas. Their critical comparison is difficult due to the different focuses and the adoption 
of different terminologies. Here we attempt to review some of the existing information 
theoretic methodologies for network inference, and clarify their differences. While some 
of these methods have achieved notable success, many challenges remain, among which we 
can mention dealing with incomplete measurements, noisy data, counterintuitive behaviour 
emerging from nonlinear relations or feedback loops, and computational burden of dealing 
with large data sets. 

Keywords: systems biology; network modeling; data-driven modeling; information theory; 
statistics; systems identification 
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1. Introduction 

Systems biology is an interdisciplinary approach for understanding complex biological systems at 
the system level [1]. Integrative mathematical models, which represent the existing knowledge in a 
compact and unambiguous way, play a central role in this field. They facilitate the exchange and critical 
examination of this knowledge, allow to test if a theory is applicable, and make quantitative predictions 
about the system's behaviour without having to carry out new experiments. In order to be predictive, 
models have to be "fed" (calibrated) by data. Although the conceptual foundations of systems biology 
had been laid several decades ago, during most of the 20th century the experimental data to support 
its models and hypotheses were missing [2]. With the development of high-throughput techniques in 
the 1990s, massive amounts of "omics" data were generated, providing the push required for the rapid 
expansion of the field. 

This review paper deals with the problem of constructing models of biological systems from 
experimental data. More specifically, we are interested in reverse engineering cellular systems that 
can be naturally modeled as biochemical networks. A network consists of a set of nodes and a set of 
links between them. In cellular networks the nodes are molecular entities such as genes, proteins, or 
metabolites. The links or edges are the interactions between nodes, such as the chemical reactions where 
the molecules are present, or a higher level abstraction such as a regulatory interaction involving several 
reactions. Thus cellular networks can be classified, according to the type of entities and interactions 
involved, as gene regulatory, metabolic, or protein signaling networks. 

The main goal of the methods studied here is to infer the network structure, that is, to deduce the set of 
interactions between nodes. This means that the focus is put on methods that — if we choose metabolism 
as an example — aim at finding which metabolites appear in the same reaction, as opposed to methods 
that aim at the detailed characterization of the reaction (determining its rate law and estimating the values 
of its kinetic constants). The latter is a related but different part of the inverse problem, and will not be 
considered here. 

Some attributes of the entities are measurable, such as the concentration of a metabolite or the 
expression level of a gene. When available, those data are used as the input for the inference procedure. 
For that purpose, attributes are considered random variables that can be analyzed with statistical tools. 
For example, dependencies between variables can be expressed by correlation measures. Information 
theory provides a rigorous theoretical framework for studying the relations between attributes. 

Information theory can be viewed as a branch of applied mathematics, or more specifically as a 
branch of probability theory [3], that deals with the quantitative study of information. The foundational 
moment of this discipline took place in 1948 with the publication by C.E. Shannon of the seminal paper 
"A mathematical theory of communication" [4]. Indeed, that title is a good definition of information 
theory. Originally developed for communication engineering applications, the use of information theory 
was soon extended to related fields such as electrical engineering, systems and control theory, computer 
science, and also to more distant disciplines like biology [5]. Nowadays the use of information-theoretic 
concepts is common in a wide range of scientific fields. 

The fundamental notion of information theory is entropy, which quantifies the uncertainty of a random 
variable and is used as a measure of information. Closely related to entropy is mutual information, 
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a measure of the amount of information that one random variable provides about another. These 
concepts can be used to infer interactions between variables from experimental data, thus allowing 
reverse engineering of cellular networks. 

A number of surveys, which approach the network inference problem from different points of view, 
including information-theoretic and other methods, have been published in the past. To the best of 
the authors' knowledge, the first survey focused on identification of biological systems dates back to 
1978 [6]. More recently, one of the first reviews to be published in the "high-throughput data era" 
was [7]. Methods that determine biochemical reaction mechanisms from time series concentration data 
were reviewed in [8], including parameter estimation. In the same area, a more recent perspective 
(with a narrower scope) can be found in [9]. Techniques developed specifically for gene regulatory 
network inference were covered in [10] — which included an extensive overview of the different modeling 
formalisms — and in [11], as well as in other reviews that include also methods applicable to other 
types of networks [12,13]. Methods for the reconstruction of plant gene co-expression networks from 
transcriptomic data were reviewed in [14]. The survey [15] covers not only network inference but also 
other topics, although it does not discuss information theoretic methods. Recently, [16] studied the 
advantages and limitations of network inference methods, classifying them according to the strategies 
that they use to deal with underdetermination. Other reviews do not attempt to cover all the literature, 
but instead focus on a subset of methods on which they carry out detailed comparisons, such as [17-20]. 

The problem of network inference has been investigated in many different communities. The 
aforementioned reviews deal mostly with biological applications, and were published in journals of 
bioinformatics, systems biology, microbiology, molecular biology, physical chemistry, and control 
engineering communities. Many more papers on the subject are regularly published in journals from 
other areas. Systems identification, a part of systems and control theory, is a discipline in its own 
right, with a rich literature [21,22]. However, in contrast to biology, it deals mostly with engineered 
systems, and hence its approaches are frequently difficult to adapt or not appropriate for reverse 
engineering complex biological systems. Other research areas such as machine learning have produced 
many theoretically rigorous results about network inference, but their transfer to biological applications 
is not frequently carried out. In this survey we intend to give a broad overview of the literature 
from the different — although sometimes partially overlapping — communities that deal with the network 
inference problem with an information theoretic approach. Thus we review papers from the fields of 
statistics, machine learning, systems identification, chemistry, physics, and biology. We focus on those 
contributions that have been or are more likely to be applied to cellular networks. 

2. Background 

2.1. Correlations, Probabilities and Entropies 

Biological sciences have a long history of using statistical tools to measure the strength of dependence 
among variables. An early example is the correlation coefficient r [23,24], which quantifies the linear 
dependence between two random variables X and Y. It is commonly referred to as the Pearson 
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correlation coefficient, and it is defined as the covariance of the two variables divided by the product 
of their standard deviations. For n samples it is: 

r ( A-, y) = ^(x,-rm-y) 

where Xi, Yj are the n data points, and X, Y are their averages. If both variables are linearly independent, 
r(X, Y) = 0 and knowledge of one of them does not provide any information about the other. In the 
opposite situation, where one variable is completely determined by the other, all the data points lie on a 
line and r(X, Y) = ±1. 

It should be noted that in this context the word "linear" may be used in two different ways. When 
applied to a deterministic system, it means that the differential equations that define the evolution of the 
system's variables in time are linear. On the other hand, when applied to the relationship between two 
variables, it means that the two-dimensional plot of their values (not of the variables as a function of time, 
X(t), Y(t), but of one variable as a function of the other, X(Y)) forms a straight line, independently of 
the character of the underlying system. 

A related concept, partial correlation, measures the dependence between two random variables X 
and Y after removing the effect of a third variable Z. It can be expressed in terms of the correlation 
coefficients as follows: 

r { X,Y)-riX,Z)r { Y,Z) 
,J(l-rHX,Z))(\-rHY,Z)) 
The Pearson coefficient is easy to calculate and symmetric, and its range of values has a clear 
interpretation. However, as noted in [25,26], it uses the second moment of the pair distribution 
function (1), discarding all higher moments. For certain strong nonlinearities and correlations extending 
over several variables, higher than the second moment of the pair probability distribution function may 
contribute and an alternative measure of dependence may be more appropriate. Hence the Pearson 
coefficient is not an accurate way of measuring nonlinear correlations, which are ubiquitous in biology. 
A more general measure is mutual information, a fundamental concept of information theory defined by 
Shannon [4]. To define it we must first introduce the concept of entropy, which is the uncertainty of a 
single random variable: let X be a discrete random vector with alphabet x and probability mass function 
p(x). The entropy is: 

H(X) = -J2 P( x ) l °9 P( x ) (3) 

where log is usually the logarithm to the base 2, although the natural logarithm may also be used. Entropy 
can be interpreted as the expected value of log that is 

H(X) = E p log (4) 
The joint entropy of a pair of discrete random variables (X,Y) is 



H(X, Y) = — J2J2p( x > P{x, y) = -E p log p(X, Y) 

x y 



(5) 



Cells 2013, 2 



310 



Conditional entropy H(Y\X) is the entropy of a random variable conditional on the knowledge of 
another random variable. It is the expected value of the entropies of the conditional distributions, 
averaged over the conditioning random variable. For example, for two random variables X and Y 
we have 

H(Y\X) = Z x P(x)H(Y\X = x) = -Z x P(x) Z y P(y\x)logp(y\x) 

= -J2 x J2 y p( x ,y)i°9 p(y\x) = -E p{x , y) iog p(y\x) 

The joint entropy and the conditional entropy are related so that the entropy of a pair of random 
variables is the entropy of one plus the conditional entropy of the other: 

H{X,Y)=H(X)+H(Y\X) (7) 

The relative entropy is a measure of the distance between two distributions with probability functions 
p(x) and q(x). It is defined as: 

D(p\\q) = J2pW°9 ^ = E p log ^ (8) 

The relative entropy is always non-negative, and it is zero if and only if p = q. However, it is not a 
true distance because it is not symmetric and it does not satisfy the triangle inequality. 

2.2. Mutual Information 

Mutual information, /, is a special case of relative entropy: it is the relative entropy between the joint 
distribution, p(x, y), and the product distribution, p(x)p(y), that is: 



H*,Y) = J2Y,P^y) l °9 = D(p(x,y)\\p(x)p(y)) = E p{x>y) log (9) 

x y 

Linfoot [27] proposed the use of mutual information as a generalization of the correlation coefficient 
and introduced a normalization with values ranging from 0 to 1 : 

I L (X,Y) = Vl -e- 2/ W) (10) 

The mutual information is a measure of the amount of information that one random variable contains 
about another. It can also be defined as the reduction in the uncertainty of one variable due to the 
knowledge of another. Mutual information is related to entropy as follows: 

/(X, Y) = H(X) - H{X\Y) = H(X) + H{Y) - H{X, Y) (11) 

Finally, the conditional mutual information measures the amount of information shared by two 
variables when a third variable is known: 



I(X,Y\Z) = H(X\Z) -H(X\Y,Z) (12) 
If Y and Z carry the same information about X, the conditional mutual information I(X, Y\Z) is zero. 
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The relationship between entropy, joint entropy, conditional entropy, and mutual information is 
graphically depicted in Figure 1. Note that until now we have considered implicitly discrete variables; 
in the case of continuous variables the are replaced by J . For more detailed descriptions of these 
concepts, see [28]. 

Figure 1. Graphical representation of the entropies (H(X), H(Y)), joint entropy 
(H(X,Y)), conditional entropies (H(X\Y), H(Y \X% and mutual information (I{X,Y)) 
of a pair of random variables (X, Y). 




O • 00 

H(X) H(Y) l(X,Y) 

Mutual information is a general measure of dependencies between variables. This suggests its 
application for evaluating similarities between datasets, which allows for inferring interaction networks 
of any kind: chemical, biological, social, or other. If two components of a network interact closely, 
their mutual information will be large; if they are not related, it will be theoretically zero. As already 
mentioned, mutual information is more general than the Pearson correlation coefficient, which is only 
rigorously applicable to linear correlations with Gaussian noise. Hence, mutual information may be able 
to detect additional non-linear correlations undetectable for the Pearson coefficient, as has been shown 
for example in [29] where it was demonstrated with metabolic data. 

In practice, for the purpose of network inference, mutual information cannot be analytically 
calculated, because the underlying network is unknown. Therefore, it must be estimated from 
experimental data, a task for which several algorithms of different complexity can be used. The most 
straightforward approximation is to use a "naive" algorithm that partitions the data into a number 
of bins of a fixed width, and approximates the probabilities by the frequencies of occurrence. This 
simple approach has the drawback that the mutual information is systematically overestimated [30]. 
A more sophisticated option uses adaptive partitioning, where the bin size in the partition depends 
on the density of data points. This is the case of the classic algorithm by Fraser and Swinney [31], 
which manages to improve the estimations although at the cost of increasing the computation times 
considerably. A more efficient version of this method was presented in [32], together with a comparison 
of alternative numerical algorithms. Another computationally demanding option is to use kernel density 
estimation for estimating the probability density p(x), which can then be applied to estimation of mutual 
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information [33]. Recently, Hausser and Strimmer [34] presented a procedure for the effective estimation 
of entropy and mutual information from small sample data, and demonstrated its application to the 
inference of high-dimensional gene association networks. More details about the influence of the choice 
of estimators of mutual information on the network inference problem can be found in [35,36], including 
numerical comparisons between several methods. 

Another issue related to estimation of mutual information is the determination of a threshold to 
distinguish interaction from non-interaction. One solution is given by the minimum description length 
(MDL) principle [37], which states that, given a dataset and several candidate models, one should choose 
the model that provides the shortest encoding of the data. The MDL principle seeks to achieve a good 
trade-off between model complexity and accuracy of data fitting. It is similar to other criteria for model 
selection, such as the popular Akaike (AIC) and Bayesian information criterion (BIC). Like the BIC, 
the MDL takes into account the sample size, and minimizes both the model coding length and the data 
coding length. 

We finish this section mentioning that a discussion of some issues concerning the definition of 
multivariate dependence has been presented in [38]. The aim of the analysis was to clarify the concept of 
dependence among different variables, in order to be able to distinguish between independent (additive) 
and cooperative (multiplicative) regulation. 

2.3. Generalizations of Information Theory 

In the 1960s Marko proposed a generalization of Shannon's information theory called bidirectional 
information theory [39,40]. Its aim was to distinguish the direction of information flow, which was 
considered necessary to describe generation and processing of information by living beings. The concept 
of Directed Transinformation (DTI) was introduced as extension of mutual information (which Shannon 
called transinformation). Let us consider two entities Mi and M 2 , with X being a symbol of Mi and Y 
of M 2 . Then the directed transinformation from Mi to M 2 is 



where p(X\X n ) represents the conditional probability for the occurrence of X when n previous symbols 
X n of the own process are known, and p(X\X n Y n ) is the conditional probability for the occurrence of 
X when n previous symbols X n of the own process as well as of the other process Y n are known. The 
directed transinformation from M 2 to Af x is defined in the same way, replacing X with Y and vice versa. 
The sum of both transinformations equals Shannon's transinformation or mutual information, that is: 



Marko's work was continued two decades later by Massey [41], who defined the directed 
information I(X N — > Y N ) from a sequence X N to a sequence Y N as a slight modification of the 
directed transinformation: 




(13) 



I = T X2 + T 21 



(14) 



N 




(15) 



n=l 



Cells 2013, 2 



313 



If no feedback between Y and X is present, then the directed information and the traditional mutual 
information are equal, I(X N -» Y N ) = I(X N ; Y N ). 

Another generalization of Shannon entropy is the concept of nonextensive entropy. Shannon entropy 
(also called Boltzmann-Gibbs entropy, which we denote here as H BG ) agrees with standard statistical 
mechanics, a theory that applies to a large class of physical systems: those for which ergodicity 
is satisfied at the microscopic dynamical level. Standard statistical mechanics is extensive, that is, 
it assumes that, for a system S consisting of N independent subsystems S±, . . . , Sn, it holds that 
Hbg(S) = Y^iLi HsciSi). This property is a result of the short-range nature of the interactions typically 
considered (think, for example, of the entropy of two subsets of an ideal gas). However, there are many 
systems where long-range interactions exist, and thus violate this hypothesis — a fact not always made 
explicit in the literature. To overcome this limitation, in 1988 Constantino Tsallis [42] proposed the 
following generalization of the Boltzmann-Gibbs entropy: 

HJX) = -fc l^IXgiM! (1 6) 
1 — q 

where k is a positive constant that sets the dimension and scale, pi are the probabilities associated with 
the to distinct configurations of the system, and q E 9ft is the so-called entropic parameter, which 
characterizes the generalization. The entropic parameter characterizes the degree of nonextensivity, 
which in the limit q — > 1 recovers H q= \ = —kj^t Pi log pi, with k = ks, the Boltzmann constant. 
The generalized entropy H q is the basis of what has been called non-extensive statistical mechanics, as 
opposed to the standard statistical mechanics based on H BG . Indeed, H q is non-extensive for systems 
without correlations; however, for complex systems with long-range correlations the reverse is true: Hbg 
is non-extensive and is not an appropriate entropy measure, while H q becomes extensive [43]. It has been 
suggested that the degree of nonextensivity can be used as a measure of complexity [44]. Scale-free 
networks [45,46] are an example of systems for which H q is extensive and Hbg is not. Scale-free 
networks are characterized by the fact that their vertex connectivities follow a scale-free power-law 
distribution. It has been recognized that many complex systems from different areas — technological, 
social, and biological — are of this type. For these systems, it has been suggested that it is more 
meaningful to define the entropy in the form of Equation (16) instead of Equation (3). By defining 
the g-logarithm function as ln q (x) = x \l q 1 , the nonextensive entropy can be expressed in a similar 
form as the Boltzmann-Gibbs entropy, Equation (3): 

H q {X) = - 5>(s) ln q p(x) = P(X) Z P t )9 W 

X X ^ 

and analogously one can define nonextensive versions of conditional entropy or mutual information. 
3. Review of Network Inference Methods 

3.1. Detecting Interactions: Correlations and Mutual Information 

Early examples of techniques based on mutual information in a biological context can be found 
in [47], where it was used to determine eukaryotic protein coding regions, and [48], where it was applied 
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for analyzing covariation of mutations in the V3 loop of the HIV-1 envelope protein. Since then many 
more examples have followed, with the first applications in network inference appearing in the second 
half of the 1990s. Specifically, in the 1998 Pacific Symposium on Biocomputing two methods for reverse 
engineering gene networks based on mutual information were presented. The REVEAL [49] algorithm 
used Boolean (on/off) models of gene networks and inferred interactions from mutual information. It 
was implemented in C and tested on synthetic data, with good results reported for a network of 50 
elements and 3 inputs per element. In another contribution from the same symposium [50] mutual 
information was normalized as lNorm(X, Y) = ^4^Ly» , and a distance matrix was then defined 
as d M (X, Y) = 1 — I Nm (X,Y). The distance matrix was used to find correlated patterns of gene 
expression from time series data. The normalization presents two advantages: the value of the distance 
d is between 0 and 1, and d(Xi, Xj) = 0. 

Two years later, Butte et al. [51] proposed a technique for finding functional genomic clusters in 
RNA expression data, called mutual information relevance networks. Pair-wise mutual information 
between genes was calculated as in Equation (11), and it was hypothesized that associations with high 
mutual information were biologically related. Simultaneously, the same group published a related 
method [52] that used the correlation coefficient r Equation (1) instead of mutual information. The 
method, known as relevance networks (RN), was used to discover functional relationships between RNA 
expression and chemotherapeutic susceptibility. In this work the similarity of patterns of features was 
rated using pair- wise correlation coefficients defined as f 2 = ab r s ^ r 2 . Butte et al. mentioned a number of 
advantages of their method over previous ones. First, relevance networks are able to display nodes with 
varying degrees of cross-connectivity, while phylogenetic-type trees such as the aforementioned [50] 
can only link each feature to one other feature, without additional links. Second, phylogenetic-type 
trees cannot easily cluster different types of biological data. For example, they can cluster genes and 
anticancer agents separately, but do not easily determine associations between genes and anticancer 
agents. Third, clustering methods such as [50] may ignore genes whose expression levels are highly 
negatively correlated across cell lines; in contrast, in RN negative and positive correlations are treated in 
the same way and are used in clustering. 

Pearson's correlation coefficient was also used in [53] to assemble a gene coexpression network, 
with the ultimate goal of finding genetic modules that are conserved across evolution. DNA microarray 
data from humans, flies, worms, and yeast were used, and 22,163 coexpression relationships were 
found. The predictions implied by some of the discovered links were experimentally confirmed, and 
cell proliferation functions were identified for several genes. 

In [54] transcriptional gene networks in human and mouse were reverse-engineered using a simple 
mutual information approach, where the expression values were discretized into three bins. The 
relevance of this study is due to the massive datasets used: 20,255 gene expression profiles from human 
samples, from which 4,817,629 connections were inferred. Furthermore, a subset of not previously 
described protein-protein interactions was experimentally validated. For a discussion on the use of 
information theory to detect protein-protein interactions, see Section 3.1 of [55]. 

The aforementioned methods were developed mostly for gene expression data. In contrast, the next 
two techniques, Correlation Metric Construction (CMC) and Entropy Metric Construction (EMC), aimed 
at reverse engineering chemical reaction mechanisms, and used time series data (typically metabolic) of 
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the concentration of the species present in the mechanism. In CMC [56] the time-lagged correlations 
between two species are calculated as 5y(r) =< (xi(t) — x~i)(xj(t + r) — Xj) >, where <> denotes 
the time average over all measurements, and x~i is the time average of the concentration of the time 
series of species i. From these functions a correlation matrix R(r) is calculated; its elements are 
r ij( T ) = $ij ( r ) / y/Sii ( r ) Sjj (r ) . Then the elements dfj MC of the distance matrix are obtained as 



d?j MC = (cjj — 2c i: ,- + Cjj) 1 / 2 = a/2(1 — Cij), where Cy = max |ry(r)| T . Finally, Multidimensional 
Scaling (MDS) is applied to the distance matrix, yielding a configuration of points representing each 
of the species, which are connected by lines that are estimates for the connectivities of the species in 
the reactions. Furthermore, the temporal ordering of the correlation maxima provides an indication of 
the causality of the reactions. CMC was first tested on a simulated chemical reaction mechanism [56], 
and was later successfully applied to the reconstruction of the glycolytic pathway from experimental 
data [57]. More recently, it has been integrated in a systematic model building pipeline [58], which 
includes not only inference of the chemical network, but also data preprocessing, automatic model family 
generation, model selection and statistical analysis. 

The Entropy Metric Construction method, EMC [25,26], is a modification of CMC that replaces the 
correlation measures with entropy-based distances, -^^ywjy) = e H ( x ' Y ^~ H ( x ^~ H ( Y ^ = e~ /( - x ' y ). The 
EMC correlation distance is the minimum regardless of r: dfj MC = min T e~ I ^ x ' Y > . If the correlation is 
Gaussian, usually d^ MC ps dfj MC . Originally, EMC was applied to an artificial reaction mechanism with 
pseudo-experimental, for which it was reported to outperform CMC. Recently [59] it has been tested with 
the same glycolytic pathway reconstructed by CMC [57], with both methods yielding similar results. 

Recently, CMC/EMC has inspired a method [60] that combines network inference by time-lagged 
correlation and estimation of kinetic parameters with a maximum likelihood approach. It was applied 
to a test case from pharmacokinetics: the deduction of the metabolic pathway of gemcitabine, using 
synthetic and experimental data. 

The empirical distance correlation (DCOR) was presented in [61,62]. Given a random sample of n 
random vectors (X, Y), Euclidean distance matrices are calculated as a k i = \X k — Xi\, b k i = \Yk — Yi\. 
Define a k . = \ Y2=\ a kh «./ = \ Yl=\ a kh = ^2 YX,i=\ a ^ A m = a k i - d k . - a, t + a.., and 
similarly for B k j. Then the empirical distance covariance u n (X, Y) is the nonnegative quantity defined by 

1 - 

v 2 n (X,Y) = -J2 A kiB k i (18) 

k,l=l 

Similarly, u n (X) = v n (X, X) = J2ki=i A li> an ^ the distance correlation DCOR = R n (X, Y) is 
the square root of 



v 2 n (Xy n (Y) > 0 



R 2 (X,Y) = { x/^MO 7 )' nV nV (19) 
0, <{X)v 2 n {Y) = 0 

Unlike the classical definition of correlation, distance correlation is zero only if the random vectors 
are independent. Furthermore, it is defined for X and Y in arbitrary dimensions, rather than to 
univariate quantities. DCOR is a good example of a method that has gained recognition inside a 
research community (statistics) but whose merits have hardly become known to scientists working on 
other areas (such as the applied biological sciences). Some recent exceptions have recently appeared. 
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In [63] it was used for the detection of long-range concerted motion in proteins. In a study concerning 
mortality [64], significant distance correlations were found between death ages, lifestyle factors, 
and family relationships. As for applications in network inference, [65] compared eight statistical 
measures, including distance covariance, evaluating their performance in gene association problems (the 
other measures being Spearman rank correlation, Weighted Rank Correlation, Kendall, Hoeffding's D 
measure, Theil-Sen, Rank Theil-Sen, and Pearson). Interestingly, the least efficient methods turned out 
to be Pearson and distance covariance. 

The Maximal Information Coefficient (MIC) is another recently proposed measure of association 
between variables [66]. It was designed with the goal of assigning similar values to equally noisy 
relationships, independently of the type of association, a property termed "equitability". The main idea 
behind MIC is that if two variables (X, Y) are related, their relationship can be encapsulated by a grid 
that partitions the data in the scatter plot. Thus, all possible grids are explored (up to a maximal resolution 
that depends on the sample size) and for each m-by-n grid the largest possible mutual information 
I(X, Y) is computed. Then the mutual information values are normalized between 0 and 1, ensuring 
a fair comparison between grids of different dimensions. The MIC measure is defined as the maximum 
of the normalized mutual information values [67]: 



where |X| and \Y\ are the number of bins for each variable and B the maximal resolution. This 
methodology has been applied to data sets in global health, gene expression, major- league baseball, and 
the human gut microbiota [66], demonstrating its ability for identifying known and novel relationships. 

The claims about MIC's performance expressed in the original publication [66] have generated some 
criticism. In a comment posted on the publication web site, Simon and Tibshirani reminded that, 
since there is "no free lunch" in Statistics, tests designed to have high power against all alternatives 
have low power in many important situations. Hence, the fact that MIC has no preference for some 
alternatives over others (equitability) can be counterproductive in many cases. Simon and Tibshirani 
reported simulation results showing that MIC has lower power than DCOR for most relationships, and 
that in some cases MIC is less powerful than Pearson correlation as well. These deficiencies would 
indicate that MIC will produce many false positives in large scale problems, and that the use of the 
distance correlation measure is more advisable. In a similar comment, Gorfine et al. opposed the claim 
that non-equitable methods are less practical for data exploration, arguing that both DCOR and their own 
HHG method [68] are more powerful than the test based on MIC. At the moment of writing this article, 
the debate about the concept of equitability and its relation to mutual information and the MIC is very 
active at the arXiv website, with opposite views such as the ones expressed in [67,69]. 

Recently, the nonextensive entropy proposed by Tsallis has also been used in the context of 
reverse-engineering gene networks [70]. Given some temporal data, the method fixes a gene target 
%i and looks for the group of genes g that minimizes the nonextensive conditional entropy for a fixed q: 



MIC(X,Y) = max\ X \\Y\<B 



I(X,Y) 



(20) 



log (min(\X\, \Y\)) 




(21) 
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The reported results show an improvement on the inference accuracy by adopting nonextensive 
entropies instead of traditional entropies. The best computational results in terms of reduction of the 
number of false positives were obtained with the range of values 2.5 < q < 3.5, which corresponds to 
subextensive entropy. This claim stresses the importance of the additional tuning parameter, q, allowed 
by the Tsallis entropy. The fact that q has to be fixed a priori is a drawback for its use in reverse 
engineering applications, since it is unclear how to choose its value. 

Finally, we discuss some methods that use the minimum description length principle (MDL) described 
in Subsection 2.2. MDL was applied in [71] to infer gene regulatory networks from time series data, 
reporting good results with both synthetic datasets and experimental data from Drosophila melanogaster. 
While that method eliminated the need for a user-defined threshold value, it introduced the need for 
a user-defined tuning parameter to balance the contributions of model and data coding lengths. To 
overcome this drawback, in [72] it was proposed to use as the description length a theoretical measure 
derived from a maximum likelihood model. This alternative was reported to improve the accuracy of 
reconstructions of Boolean networks. The same goal was pursued in [73], where a network inference 
method that included a predictive MDL criterion was presented. This approach incorporated not only 
mutual information, but also conditional mutual information. 

3.2. Distinguishing between Direct and Indirect Interactions 

A number of methods have been proposed that use information theoretic considerations to distinguish 
between direct and indirect interactions. The underlying idea is to establish whether the variation in a 
variable can be explained by the variations in a subset of other variables in the system. 

The Entropy Reduction Technique, ERT [25,26], is an extension of EMC that outputs the list of 
species X* with which a given species Y reacts, in order of the reaction strength. The mathematical 
formulation stems from the observation that, if a variable Y is completely independent of a set of 
variables X, then H(Y\X) = H(Y); otherwise H(Y\X) < H(Y). The ERT algorithm is defined as 
follows [25]: 

1 . Given a species Y, start with X* = 0 

2. FindX* : #(Y|X* X*) = min x H(Y\X*, X*) 

3. SetX* = {X*,X*} 

4. Stop if H(Y\X*, X*) = H(Y\X*), or when all species except Y are already in X*; otherwise go 
to step 2 

Intuitively, the method determines whether the nonlinear variation in a variable Y, as given by its 
entropy, is explainable by the variations of a subset — possibly all — of the other variables in the system, 
X*. It is done by iterating through cycles of adding a variable X* to X* that minimizes H(Y\X*) until 
further additions do not decrease the entropy. This technique leads to an ordered set of variables that 
control the variation in Y. A methodology called MIDER (Mutual Information Distance and Entropy 
Reduction), which combines and extends features of the ERT and EMC techniques, has been recently 
developed and a MATLAB implementation is available as a free software toolbox [59]. 

The ARACNE method [74-76] is an information-theoretic algorithm for identifying transcriptional 
interactions between gene products, using microarray expression profile data. It consists of two steps. 
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In the first step, the mutual information between pairs of genes is calculated as in Equation (11), and 
pairs that have a mutual information greater than a threshold I 0 are identified as candidate interactions. 
This part is similar to the method of mutual information relevance networks [51]. In the second step, the 
Data Processing Inequality (DPI) is applied to discard indirect interactions. The DPI is a well known 
property of mutual information [28] that simply states that, if X — > Y — > Z forms a Markov chain, then 
I(X, Y) > I(X, Z). The ARACNE algorithm examines each gene triplet for which all three Mis are 
greater than I 0 and removes the edge with the smallest value. In this way, ARACNE manages to reduce 
the number of false positives, which is a limitation of mutual information relevance networks. Indeed, 
when tested on synthetic data, ARACNE outperformed relevance networks and Bayesian networks. 
ARACNE has also been applied to experimental data, with the first application being reverse engineering 
of regulatory networks in human B cells [74]. If time-course data is available, a version of ARACNE 
that considers time delays [77] can be used. 

The definition of Conditional Mutual Information (12) clearly suggests its application for 
distinguishing between direct and indirect applications. This is the idea underlying the method proposed 
in [78], which was tested on artificial and real (melanoma) datasets. The so-called direct connectivity 
metric (DCM) was introduced as a measure of the confidence in the prediction that two genes X and Y 
were connected. The DCM is defined as the following product: 

DCM = /(X, Y) ■ min Z£V _ XY I(X,Y\Z) (22) 

where min ZeV _ XY I(X,Y\Z) is the least conditional mutual information given any other gene Z. This 
method was compared with ARACNE and mutual information relevance networks [51] and was reported 
to outperform them for certain datasets. 

The Context Likelihood of Relatedness technique, CLR [79] adds a correction step to the calculation 
of mutual information, comparing the value of the mutual information between a transcription factor 
X and a gene Y with the background distribution of mutual information for all possible interactions 
involving X or Y. In this way the network context of the interactions is taken into account. The 
main idea behind CLR is that the most probable interactions are not necessarily those with the highest 
MI scores, but those whose scores are significantly above the background distribution; the additional 
correction step helps to remove false correlations. CLR was validated [79] using E. coli data and known 
regulatory interactions from RegulonDB, and compared with other methods: relevance networks [52], 
ARACNe [74], and Bayesian networks [80]. It was reported [79] that CLR demonstrated a precision 
gain of 36% relative to the next best performing algorithm. In [81] CLR was compared with a 
module-based algorithm, LeMoNe (Learning Module Networks), using expression data and databases 
of known transcriptional regulatory interactions for E. coli and S. cerevisiae. It was concluded that 
module-based and direct methods retrieve distinct parts of the networks. 

The Minimum Redundancy Networks technique (MRNET [82]) was developed for inferring genetic 
networks from microarray data. It is based on a previous method for feature selection in supervised 
learning called maximum relevance/minimum redundancy (MRMR [83-85]). Given an output variable 
Y and a set of possible input variables X, MRMR ranks the inputs according to a score that is the 
difference between the mutual information with the output variable Y (maximum relevance) and the 
average mutual information with the previously ranked variables (minimum redundancy). By doing this 
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MRMR is intended to select, from the least redundant variables, those that have the highest mutual 
information with the target. Thus, direct interactions should be better ranked than indirect interactions. 
The MRNET method uses the MRMR principle in the context of network inference. Comparisons with 
ARACNE, relevance networks, and CLR were carried out using synthetically generated data, showing 
that MRNET is competitive with these methods. The R/Bioconductor package minet [86] includes the 
four methods mentioned before, which can be used with four different entropy estimators and several 
validation tools. A known limitation of algorithms based on forward selection — such as MRNET — is 
that their results strongly depend on the first variable selected. To overcome this limitation, an enhanced 
version named MRNETB was presented in [87]; it improves the original method by using a backward 
selection strategy followed by a sequential replacement. 

A statistical learning strategy called three-way mutual information (MI3) was presented in [88]. It 
was designed to infer transcriptional regulatory networks from high throughput gene expression data. 
The procedure is in principle sufficiently general to be applied to other reverse engineering problems. 
Consider three variables R lf R 2 , and T, where Ri and R 2 are possible "regulators" of the target variable, 
T. Then the MB metric is defined as 



Both MI3 and ERT try to detect higher order interactions and, for this purpose, they use scores 
calculated from entropies H(*), and 2- and 3- variable joint entropies, H(*,*) and H(*,*,*). MI3 was 
specifically designed to detect cooperative activity between two regulators in transcriptional regulatory 
networks, and it was reported to outperform other methods such as Bayesian networks, two-way mutual 
information and a discrete version of MI3. A method [89] exploiting three-way mutual information and 
CLR was the best scorer in the 2nd conference on Dialogue for Reverse Engineering Assessments and 
Methods (DREAM2) Challenge 5 (unsigned genome-scale network prediction from blinded microarray 
data) [90]. 

A similar measure, averaged three-way mutual information (AMD), was defined in [91] as 



where Xj represents the target gene, and Yj, Yk are two regulators that may regulate Xi cooperatively. 
The first two terms are the traditional mutual information. The third term represents the cooperative 
activity between Yj and Yk, and the fourth term ensures that Yj and Y k regulate Xi directly (without 
regulation between Yj and Yk): if Yj regulates X{ indirectly through Yk, both the third and fourth terms 
will increase, cancelling each other and not leading to an increase in In [91] this score was combined 
with non-linear ordinary differential equation (ODE) modeling for inferring transcriptional networks 
from gene expression, using network-assisted regression. The resulting method was tested with synthetic 
data, reporting better performance than other algorithms (ARACNE, CLR, MRNET and SA-CLR). It 
was also applied to experimental data from E. coli and yeast, allowing to make new predictions. 

The Inferelator [92] is another freely available method for network inference. It was designed 
for inferring genome-wide transcriptional regulatory interactions, using standard regression and model 
shrinkage techniques to model the expression of a gene or cluster of genes as a function of the levels 



MI3(T;R 1 ,R 2 ) 



2J(T, (ifc, R 2 ))-I(T, Ri)—I(T, R 2 ) 

2H(R 1 , R 2 )+H{R 1 , T) + H(R 2 , T) —H(R 1 ) — H(R 2 )—2H(R 1 , R 2 , T) 



(23) 



I ijk = I{Yj- X^ + J(Y fc ; X t ) + I{Yj- Y k \Xi) - I{Yj- Y k ) 



(24) 
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of transcription factors and other influences. Its performance was demonstrated in [93], where the 
transcriptional regulatory network of Halobacterium salinarum NRC-1 was reverse-engineered and its 
responses in 147 experiments were successfully predicted. Although the Inferelator is not itself based 
on mutual information, it has performed best when combined with MI methods [94] . Specifically, it has 
been used jointly with CLR. 

Other methods have relied on correlation measures instead of mutual information for detecting 
indirect interactions. A method to construct approximate undirected dependency graphs from large-scale 
biochemical data using partial correlation coefficients was proposed in [95]. In a first step networks are 
built based on correlations between chemical species. The Pearson correlation coefficient (1) or the 
Spearman correlation coefficient may be chosen. The Spearman correlation coefficient is simply the 
Pearson correlation coefficient between the ranked variables, and measures how well the relationship 
between two variables can be described using a monotonic function. In a second step edges for which 
the partial correlation coefficient (2) falls below a certain threshold are eliminated. This procedure was 
tested both on artificial and on experimental data, and a software implementation was made available at 
the website. 

In [96] partial correlation was used to reduce the number of candidate genes. In the partial correlation 
Equation (2) the Pearson correlation coefficient r was replaced by the Spearman correlation coefficient, 
since the latter was found to be more robust for detecting nonlinear relationships between genes. 
However, the authors acknowledged that the issue deserved further investigation. 

In [97] both Pearson and Spearman correlation coefficients were tested; no practical differences 
were found between both measures. Furthermore, no clear differences were detected between 
linear (correlation) and nonlinear (mutual information) scores. For detecting indirect interactions, 
three different tools were used: partial correlation, conditional mutual information, and the data 
processing inequality, which were found to improve noticeably the performance of their non-conditioned 
counterparts. These results were obtained from artificially generated metabolic data. 

3.3. Detecting Causality 

Inferring the causality of an interaction is a complicated task, with deep theoretical implications. 
This topic has been extensively investigated by Pearl [98]. Philosophical considerations aside, from a 
practical view point we can intuitively assign a causal relation from A to B if A and B are correlated and 
A precedes B. Thus, causal interactions can be inferred if time series data is available. 

It was already mentioned that CMC can determine directionality because it takes time series 
information into account, as shown in [57] for a glycolytic path. Another network reconstruction 
algorithm based on correlations was proposed in [99] to deduce directional connections based on 
gene expression measurements. Here the directionality came from the asymmetry of the conditional 
correlation matrix, which expressed the correlation between two genes given that one of them was 
perturbed. Another approach for causal correlations was presented in Opgen-Rhein and Strimmer [100]. 
Once the correlation network is obtained, a partial ordering of the nodes is established by multiple 
testing of the log-ratio of standardized partial variances. In this way a directed acyclic causal network is 
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obtained as a subgraph of the original network. This method was validated using gene expression data 
of Arabidopsis thaliana. 

Some methods based on mutual information have taken causality into account. One of them is 
EMC [26], which is essentially the same method as CMC with a different definition of distance. Another 
one is the already mentioned TimeDelay-ARACNE method [77]. The concept of directed information 
described in Section 2 has also been applied to the reconstruction of biological networks. In [101] it 
was used for reconstructing gene networks; the method was validated using small random networks and 
simulated data from the E.Coli network for flagella biosynthesis. It was reported that, for acyclic graphs 
with 7 or fewer genes with summation operations only, the method was able to infer all edges. In [102] 
directed information was used for finding interactions between transcription factor modules and target 
co-regulated genes. The validity of the approach was demonstrated using publicly available embryonic 
kidney and T-cell microarray datasets. DTInfer, an R-package for the inference of gene -regulatory 
networks from microarray s using directed information, was presented in [103]. It was tested on 
E. coli data, predicting five novel TF-target gene interactions; one of them was validated experimentally. 
Finally, directed information has also been used in a neuroscience context [104], for inferring causal 
relationships in ensemble neural spike train recordings. 

3.4. Previous Comparisons 

As mentioned in the Introduction, there are some publications where detailed analyses and 
comparisons of some of the methods reviewed here have been carried out. For example, in [17] the 
performance of some popular algorithms was tested under different conditions and on both synthetic and 
real data. Comparisons were twofold: on the one hand, conditional similarity measures like partial 
Pearson correlation (PPC), graphical Gaussian models (GGM), and conditional mutual information 
(CMI) were compared with Pearson correlation (PC) and mutual information (MI); on the other hand, 
linear measures (PC and PPC) were compared with non-linear ones (MI, CMI, and the Data Processing 
Inequality, DPI). 

The differences and similarities of three other network inference algorithms — ARACNE, Context 
Likelihood of Relatedness (CLR), and MRNET — were studied in [35], where the influence of the entropy 
estimator was also taken into account. The performance of the methods was found to be dependent on 
the quality of the data: when complete and accurate measurements were available, the MRNET method 
combined with the Spearman correlation appeared to be the most effective. However, in the case of noisy 
and incomplete data, the best performer was CLR combined with Pearson correlation. 

The same three inference algorithms, together with the Relevance Networks method (RN), were 
compared in [18], using network-based measures in combination with ensemble simulations. In [105] 
Emmert-Streib studied the influence of environmental conditions on the performance of five network 
inference methods, ARACNE, BC3NET [106], CLR, C3NET [107], and MRNET. Comparison of their 
results for three different conditions concluded that different statistical methods lead to comparable but 
condition- specific results. The tutorial [19] evaluated the performance of ARACNE, BANJO, NIR/MNI, 
and hierarchical clustering, using synthetic data. More recently, [20] compared four tools for inferring 
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regulatory networks (ARACNE, BANJO, MIKANA, and SiGN-BN), applying them to new microarray 
datasets generated in human endothelial cells. 

4. Conclusions, Successes and Challenges 

A number of methods for inferring the connectivity of cellular networks has been reviewed in this 
article. Most of these methods, which have been published during the last two decades, adopt some 
sort of information theoretic approach for evaluating the probability of the interactions between network 
components. We have tried to review as many techniques as possible, surveying the literature from 
areas such as systems and computational biology, bioinformatics, molecular biology, microbiology, 
biophysics, physical and computational chemistry, physics, systems and process control, computer 
science, or statistics. Some methods were designed for specific purposes (e.g., reverse engineering gene 
regulatory networks), while others aim at a wider range of applications. We have attempted to give a 
unified treatment to methods from different backgrounds, clarifying their differences and similarities. 
When available, comparisons of their performances have been reported. 

It has been shown that information theory provides a solid foundation for developing reverse 
engineering methodologies, as well as a framework to analyze and compare them. Concepts such as 
entropy or mutual information are of general applicability and make no assumptions about the underlying 
systems; for example, they do not require linearity or absence of noise. Furthermore, most information 
theoretic methods are scalable and can be applied to large-scale networks with hundreds or thousands 
of components. This gives them in some cases an advantage over other techniques that have higher 
computational cost, such as Bayesian approaches. 

A conclusion of this review is that no single method outperforms the rest for all problems. There is 
"no free lunch": methods that are carefully tailored to a particular application or dataset may yield better 
results than others when applied to that particular problem, but frequently perform worse when applied 
to different systems. Therefore, when facing a new problem it may be useful to try several methods. 
Interestingly, the results of the DREAM challenges show that community predictions are more reliable 
than individual predictions [108-110]; that is, the best option is to take into account the reconstructions 
provided by all the methods, as opposed to trusting only the best performing ones. 

In the last fifteen years different information theoretic methods have been successfully applied to 
the reverse engineering of genetic networks. The resulting predictions about existing interactions have 
enabled the design of new experiments and the generation of hypotheses that were later confirmed 
experimentally, demonstrating the ability of computational modeling to provide biological insight. 
Another indication of the success of the information theoretic approach is that in recent years methods 
that combine mutual information with other techniques have been among the top performers in the 
DREAM reverse engineering challenges [94]. Success stories have appeared also regarding their 
application to reconstruction of chemical reaction mechanisms. One of the earliest was the validation of 
the CMC method, which was able to infer a significant part of the glycolytic path from experimental data. 

Despite all the advances made in the last decades, the problem faced by these methods (inferring 
large-scale networks with nonlinear interactions from incomplete and noisy data) remains challenging. 
To progress towards that goal, several breakthroughs need to be achieved. A systematic way 
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of determining causality that is valid for large-scale systems is still lacking. Computational and 
experimental procedures for identifying feedback loops and other complex structures are also needed. 
For these and other obstacles to be overcome, the future developments should be aware of the existing 
methodologies and build on their capabilities. We hope that this review will help researchers in that task. 
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